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Cheat Sheet — Bias-Variance Tradeoff 


What is Bias? bias = E[f'(x)] — f(z) 
e Error between average model prediction and ground truth 
e The bias of the estimated function tells us the capacity of the underlying model to 
predict the values 
What is Variance? variance = E l(F'(@) — E[f’(e)))’| 
e Average variability in the model prediction for the given dataset 
e The variance of the estimated function tells you how much the function can adjust 
to the change in the dataset 


High Bias ———<» Overly-simplified Model 
= Under-fitting 
= High error on both test and train data 


High Variance —_—— QOverly-complex Model 
——» Over-fitting 
= Low error on train data and high on test 
———» Starts modelling the noise in the input 


High Bias Low Bias 
Low Variance High Variance 
Low Bias High Bias 


© 

= Minimum Error 

a ia 
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= Variance 
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= 

=. Under-fitting Just Right Over-fitting 
Preferred if size Preferred if size 

of dataset is small of dataset is large 


Bias variance Trade-off 
e Increasing bias (not always) reduces variance and vice-versa 
e Error = bias? + variance +irreducible error 

e The best model is where the error is reduced. 

e Compromise between bias and variance 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


Cheat Sheet — Imbalanced Data in Classification 


EI @RAGOASEE : 
> @ Correct Predictions Ei D = E E es 
I 


Accuracy = 
u Total Predictions 


Classifier that always predicts label blue yields prediction accuracy of 90% 


Accuracy doesn’t always give the correct insight about your trained model 


Accuracy: %age correct prediction Correct prediction over total predictions One value for entire network 

Precision: Exactness of model From the detected cats, how many were Each class/label has a value 
actually cats 

Recall: Completeness of model Correctly detected cats over total cats Each class/label has a value 

F1 Score: Combines Precision/Recall Harmonic mean of Precision and Recall Each class/label has a value 


Performance metrics associated with Class 1 


(Is your prediction correct?) (What did you predict) 


Actual Labels OT True Negative 
(Your prediction is correct) (You predicted Q) 
a True False . Fal = 
E i = Precision = ~~ alse +ve rate = ————— 
z Positive Positive + + 
Prec x Rec TPN 
E F1 score = 2x — Accuracy = — ee 
z on m+- 
oe False True TN 
pov : TN 
an Negative Negative Spenny = Recall, Sensitivity = 
+g True +ve rate ot 


Possible solutions 
1. Data Replication: Replicate the available data until the Blue: Label 1 £ @ C) A TS AOe@ 


number of samples are comparable Green: Label O o0 0000 ee 
2. Synthetic Data: Images: Rotate, dilate, crop, add noise to Blue: Label 1 :@ A e ge AOe@ 


existing input images and create new data Green: Label 0 :@ a A OO Atg 
3. Modified Loss: Modify the loss to reflect greater error when 


i oe loss = a* lOSSgreen t+ b * lOSSpiye =a >b 
misclassifying smaller sample set 


4. Change the algorithm: Increase the model/algorithm complexity so that the two classes are perfectly 
separable (Con: Overfitting) 


Increase model 


complexity 
No straight line (y=ax) passing through origin can perfectly Straight line (y=ax+b) can perfectly separate data. 
separate data. Best solution: line y=0, predict all labels blue Green class will no longer be predicted as blue 


Source: https: //www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


Cheat Sheet — PCA Dimensionality Reduction 


What is PCA? 

e Based on the dataset find a new set of orthogonal feature vectors in such a way that the 
data spread is maximum in the direction of the feature vector (or dimension) 

e Rates the feature vector in the decreasing order of data spread (or variance) 

e The datapoints have maximum variance in the first feature vector, and minimum variance 
in the last feature vector 

e The variance of the datapoints in the direction of feature vector can be termed as a 
measure of information in that direction. 


Steps x. = X — mean(X) 
1. Standardize the datapoints OS Tn tO) 

2. Find the covariance matrix from the given datapoints Cli, j] = coul Ti, T3) 
3. Carry out eigen-value decomposition of the covariance matrix C=vyv-! 
4. Sort the eigenvalues and eigenvectors Dsort = sort(©) Veore = sort(V, Ueort) 


Dimensionality Reduction with PCA 

e Keep the first m out of n feature vectors rated by PCA. These m vectors will be the best m 
vectors preserving the maximum information that could have been preserved with m 
vectors on the given dataset 

Steps: 

1. Carry out steps 1-4 from above 

2. Keep first m feature vectors from the sorted eigenvector matrix Vreduced = V[:,0 : m] 

3. Transform the data for the new basis (feature vectors) Xreduced = Xnew X Vreduced 

4. The importance of the feature vector is proportional to the magnitude of the eigen value 


Figure 1 Figure 2 


a 
& 
2 a : 
R 3 S 
> 3 
o 
È O 
a F2 Fl 
F2 Fl Feature # 2 (F2) 
Figure 3 Figure 1: Datapoints with feature vectors as 


x and y-axis 
Figure 2: The cartesian coordinate system is 
rotated to maximize the standard deviation 
along any one axis (new feature # 2) 
Figure 3: Remove the feature vector with 
minimum standard deviation of datapoints 
F2 (new feature # 1) and project the data on 
new feature # 2 


Variance 


Source: https: //www.cheatsheets.aqeel-anwar.com 


Cheat Sheet — Bayes Theorem and Classifier 


What is Bayes’ Theorem? 
e Describes the probability of an event, based on prior knowledge of conditions that might be 
related to the event. 


P(AIB) P(B|A)(likelihood) x P(A)(prior) 


P(B)(evidence) R P(AIB) 
e How the probability of an event changes when 


Posterior 
Probability 


we have knowledge of another event 
P(A) ——> P(AIB) 
Ly Usually, a better 


t 
estimate than P(A) 
E -E - 
e Probability of fire P(F) = 1% N 
e Probability of smoke P(S) = 10% 


A E 
e Prob of smoke given there is a fire P(S|F) = 90% | Likelihood | AO) | Evidence | 
e What is the probability that there is a fire given =o Prior [eee 
we see a smoke P(F|S)? Hrobakiity 
P(S|F)x P(F) 0.9 x 0.01 
(FIS) P(S) 0.1 "i 
Maximum Aposteriori Probability (MAP) Estimation 
The MAP estimate of the random variable y, given that we have observed iid (xj, X2, X3, ... ), is 
given by. We try to accommodate our prior knowledge when estimating. 
that maximizes the product of 
= argmaxy PY) I] P(xily) a e product o 
Maximum Likelihood Estimation (MLE) 
The MAP estimate of the random variable y, given that we have observed iid (xj, X2, x3, ... ), is 
given by. We assume we don’t have any prior knowledge of the quantity being estimated. 
9 = = argmaxy I] P(a;ly) y that maximizes only the 
i likelihood 


MLE is a special case of MAP where our prior is uniform (all values are equally likely) 


Naive Bayes’ Classifier (Instantiation of MAP as classifier) 
Suppose we have two classes, y=y; and y=y2. Say we have more than one evidence/features (x1, 


Xo, X3, -.- ), using Bayes’ theorem 
EAD tones, 226 ly) x P(y) 


Pais T2, L33». D 
Naive Bayes’ theorem assumes the features (x1, Xo, ... ) are iid. i.e P(z1, £2, £3,... |y) = [[?@iy) 


PGi oi, 63)003,.4-) = ITP zily) tS 


BREL LS, oa .) 


P(y|x1, £2, £3, Ja .) = 


Cen T3,-- ) 


> 1 else y = yo 
P(y2|21, £2, £3,.--) 4 


y= tf 


Source: https: //www.cheatsheets.aqeel-anwar.com 


Cheat Sheet — Regression Analysis 


What is Regression Analysis? 
Fitting a function f(.) to datapoints y;=f(x;) under some error function. Based on the estimated 


function and error, we have the following types of regression 


I. Linear Regression: l ay Xin- frear (a)l 
Fits a line minimizing the sum of mean-squared error i 
0 li 
for each datapoint. fi pera i) = Bot Pixi 
. : ‘ oly y 
2. Polynomial Regression: Sa TE Sje- poly (ay, 2 
Fits a polynomial of order k (k+1 unknowns) minimizing i=0 
the sum of mean-squared error for each datapoint. 8° (ai) = Bo + bizi + Box? +... + Beat 
3. Bayesian Regression: 
b z A i i i 2 2 
For each datapoint, fits a gaussian distribution by ming 2 lui =N (fa(a:), o”) |l 
J 3 TA aLa =Q STA rT 7 l 
Se the mean. > Pee >: number of Hey = = f Yas) or fine” (a) 
ata points x; increases, it converges to poin ee 
p : : 8 E N (u, a”) — Gaussian with mean p and variance g? 


estimates i.e. n — 00,07 > 0 
4. Ridge Regression: 


oe F i beasts. tats j j= >) ||? 2 
Can fit either a line, or polynomial minimizing the sum ming X llu- fsd)? +96; 
i=0 j=0 


of mean-squared error for each datapoint and the l ran 
A fa(xi) = f5” (ai) or fE (zi) 


weighted L2 norm of the function parameters beta. 
5. LASSO Regression: m 
; elle 
Can fit either a line, or polynomial minimizing the the ee > lv fel) > 

7 . = I= 
sum of mean-squared error for each datapoint and the foli) = FE (æ) or JE” (a) 
weighted L1 norm of the function parameters beta. i 


6. Logistic Regression: ming X` —yilog (0 (fs(2:))) — (1 — yi)log (1 — o (fa(ai))) 
Can fit either a line, or polynomial with sigmoid i 
activation minimizing the binary cross-entropy loss for falzi) = 8” (ai) or fE” (ai) 
each datapoint. The labels y are binary class labels. o(t) = — 


Visual Representation: 


Linear Regression Polynomial Regression Bayesian Linear Regression Logistic Regression 


What does it fit? Estimated function Error Function 
Linear A line in n dimensions JEE (xi) = Bo + Bixi 2 ln- fala) 
Polynomial A polynomial of order k f3 (21) = Bo + Biti + Bot? +... Piel zi)? 
Bayesian Linear Gaussian distribution for each point N (falzi), 0?) 2 lui = N (fa(wi), 0°) I? 
Ridge Linear/polynomial 5° (wi) or fH" (a) ly: — falz)? wets B 
LASSO Linear/polynomial £3 m) or JETS) $ ln- soln I)? +S 
Logistic Linear/polynomial with sigmoid o(fg(2;)) hiie DS yilog (o (fala D- (1 — yi)log (1 — ifa 'i))) 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


e Regularization is an approach to address over-fitting in ML. 
variance, or when we have small amount of data, the  Under-fitting Just Right Over-fitting 


Cheat Sheet — Regularization in ML 
e Overfitted model fails to generalize estimations on test data 
estimated model is prone to over-fitting. Prefered iE e2 Peme az 


What is Regularization in ML? 
e When the underlying model to be learned is low bias/high = 
e Regularization reduces the variance of the model a Eneee ene of dataset is large 


Types of Regularization: Figure 1. Overfitting 


1. Modify the loss function: 
e L2 Regularization: Prevents the weights from getting too large (defined by L2 norm). Larger 
the weights, more complex the model is, more chances of overfitting. 


; 1 
loss = error (y, )|+ À > B?| A20, àx model bias, Àx 
; J model variance 


e L1 Regularization: Prevents the weights from getting too large (defined by L1 norm). Larger 
the weights, more complex the model is, more chances of overfitting. L1 regularization 
introduces sparsity in the weights. It forces more weights to be zero, than reducing the the 
average magnitude of all weights 


loss = error (y, ĝ)|+ ASE B;|) A>0, A « model bias, Ax 


1 
model variance 


j 
e Entropy: Used for the models that output probability. Forces the probability distribution 
towards uniform distribution. 


1 
loss = error(p, p)|— A S > pilog(pi) A> 0, àx model bias, A œx ERAT 
3 model variance 


2. Modify data sampling: 

e Data augmentation: Create more data from available data by randomly cropping, dilating, 
rotating, adding small amount of noise etc. 

e K-fold Cross-validation: Divide the data into k groups. Train on (k-1) groups and test on 1 
group. Try all k possible combinations. 


3. Change training approach: 

e Injecting noise: Add random noise to the weights when they are being learned. It pushes the 
model to be relatively insensitive to small variations in the weights, hence regularization 

e Dropout: Generally used for neural networks. Connections between consecutive layers are 
randomly dropped based on a dropout-ratio and the remaining network is trained in the 
current iteration. In the next iteration, another set of random connections are dropped. 


5-fold cross-validation Original Network Dropout-ratio = 30% 
Test Train 
Train Test | Train 
Train | Test | Train 
Train Test Train 
Train ~ Test j Connections = 16 Active = 11 (70%) Active = 11 (70%) 
Figure 2. K-fold CV Figure 3. Drop-out 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


Cheat Sheet — Convolutional Neural Network 


Convolutional Neural Network: Dataset: (x, y ) 

The data gets into the CNN through the input layer and passes n : 

through various hidden layers before getting to the output layer. 

The output of the network is compared to the actual labels in 200 ooo 
terms of loss or error. The partial derivatives of this loss w.r.t the | 

trainable weights are calculated, and the weights are updated ol E 

through one of the various methods using backpropagation. _ or 

CNN Template: ee Arn oc. 
Most of the commonly used hidden layers (not all) follow a |= corre il conte tat Ml (on ena ay 


pattern 


Convolutional Max Pooling Batch 


Linear i MSE — L2 Loss SGD 
l.Layer function: Basic transforming function such as | [seemed | |Average Pooling] so eee) [Momentum 
i VOO Max Un-pooling r E Er ELU Huber Loss AdaGrad 
convolutional or fully connected layer. Fully Connected | Average 0m | Normalisation Tanh | | Gross Entropy | |lIRMSProp 
a.Fully Connected: Linear functions between the input and the pooling eee Sess Bes 


a. Osivlutional Layers: These layers are applied to 2D (3D) input feature maps. The trainable weights are a 2D (3D) 
kernel/filter that moves across the input feature map, generating dot products with the overlapping region of the input 
feature map. 

b.Transposed Convolutional (DeConvolutional) Layer: Usually used to increase the size of the output feature map 
(Upsampling) The idea behind the transposed convolutional layer is to undo (not exactly) the convolutional layer 


Fully Connected Layer Convolutional Layer 


Transposed Convolution 


© Input Node Output Node 


A Input Map E] Kernel E Output Map 


2. Pooling: Non-trainable layer to change the size of the feature map Type: max‘pool - Stride: 1 Padding: 1 


a. Max/Average Pooling: Decrease the spatial size of the input layer based on PDO fe fele|s| | 
neo “nsus 
D: |] ° | 


selecting the maximum/average value in receptive field defined by the kernel 

b. UnPooling: A non-trainable layer used to increase the spatial size of the input 
layer based on placing the input pixel at a certain index in the receptive field 
of the output defined by the kernel. 


3. Normalization: Usually used just before the activation functions to limit the f° Pee ° | T A 
unbounded activation from increasing the output layer values too high Es ea = 

a. Local Response Normalization LRN: A non-trainable layer that square-normalizes the pixel values in a feature map 
within a local neighborhood. 


b. Batch Normalization: A trainable approach to normalizing the data by learning scale and shift variable during training. 


0 0 0 0 0 


0 0 0 0 0 


3. Activation: Introduce non-linearity so CNN can 5. Loss function: Quantifies how far off the CNN prediction 
efficiently map non-linear complex mapping. is from the actual labels. 


a. Non-parametric/Static functions: Linear, ReLU a. Regression Loss Functions: MAE, MSE, Huber loss 
b. Parametric functions: ELU, tanh, sigmoid, Leaky ReLU b. Classification Loss Functions: Cross entropy, Hinge loss 
c. Bounded functions: tanh, sigmoid p MSE Loss - p MAE Loss = Huber Loss 
Ble" —1),a <0 tanh(ax) max(x,0) 35 =a 1.75 =e { aoe iy z Eo } 
1 1 7 1 3.0 15 bare ni i 
— a= 10.0 25 ios 125 
7 2.0 10 1.0 
= 0 iS 0 3 0 15 0.75 0.75 
S] & Š 1.0 05 05 
0.5 0.25 0.25 
=_—— 6=1.0 0.0 0.0 0.0 
A J A A -2.0 -1.0 0.0 1.0 2.0 -2.0 -1.0 0.0 1.0 2.0 -2.0 -1.0 0.0 10 2.0 
-1 0 1 ži 0 1 1 0 1 
i ; Hinge Loss Cross Entropy Loss 
1 1/01 +e ) 1 max(Sx, x) 1 z 30 { mar(0,1-ĉ) :r=1 } === —ylog(p) — (1 — y)log(1 — p) 
og m | mar(0,1 +8) :r=-1 8.0 
3 3 n 2.0 6.0 
È 0 E 0 20 a 4.0 
& 2 5 
8 Š ag 1.0 
— o- 100) |“ — p=10 05 a 
si <i sj 0.0 0.0 
17 0 1 ti 0 1 Is 0 1 20 -10 00 10 20 oo 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


Cheat Sheet — Famous CNNs 


AlexNet E 20 1 2 AlexNot Network - Structural Details 
7 Stride] Pad F fof Param 
sets - 


Output Layer Kernel size|_in | out 
y 55|55] 96 |conv1 4 0 TH 3 96 34944 
Why: AlexNet was born out of the need to improve the results of Genees mmen 2 [o s s oe 
the ImageNet challenge 27 | 27 |256|13|13|256 |maxpool2| 2 0 3 3 | 256 | 256 

13 | 13 |256|13|13|384|conv3 1 1 3 3 |256 |384 885120 
; ifia fasaisrslsolemvs 1 | 1 | 3 f 3 a84] 256] 884902 
What: The network consists of 5 Convolutional (CONV) layers and 3 fetita te teema oe i i p 
Fully Connected (FC) layers. The activation used is the Rectified e (oie oa 

Linear Unit (ReLU). ) 

. . . . . VGGI6 - Structural Details 

How: Data augmentation is carried out to reduce over-fitting, Uses j= p opt e [ia| Keme in Jou] Param 
z r 2 |224|224| 64 |224|224| 64 |conv3064 1 313 64 64 36928 
Local response localization. aba ei iiz Fs e te i 
4 112 1 3 3 128 | 128 147584 
56 2 2 |2| 128 |128 65664 
— 7|5 6 1 3 | 3 | 256 | 256 590080 

E 3 2 2 2 256 |256 0 
Why: VGGNet was born out of the need to reduce the # of & TPs sie EAEE 
parameters in the CONV layers and improve on training time ERERKEN 
. . 1 3 |3 | 512 |512 2359808 
What: There are multiple variants of VGGNet (VGG16, VGG19, etc.) a fete fez toa} 
1 | 1 | 4096 [4096] 16781312 
1 $ 4096 |1000 4097000 


How: The important point to note here is that all the conv kernels are 


138,423,208 


of size 3x3 and maxpool kernels are of size 2x2 with a stride of two. 


ResNet18 - Structural Details 


Hoare s hape or fom ee 
ResNet ~ 2015 O onn 
Why: Neural Networks are notorious for not being able to find a Peeti E 
. . . . 7 |28 |28| 128 | 28 | 28 |128 |conv3-2 1 1 3/3 
simpler mapping when it exists. ResNet solves that. CA E ts 
‘ ‘* r 10| 28 | 28 | 128 | 14 | 14 | 256 |conv4-1 2 0.5 |3|3 
What: There are multiple versions of ResNetXX architectures where ppi mipupsipo i i hj 
13| 14 | 14 | 256 |14 | 14 | 256 |conv4-4 1 EN 313 
XX’ denotes the number of layers. The most used ones are ResNet50 fsettoetoeiss H H HN 
and ResNet101. Since the vanishing gradient problem was taken care of Ertepethirter prii tiete a 
(more about it in the How part), CNN started to get deeper and deeper Taal Tas 
How: ResNet architecture makes use of shortcut connections do solve . 
the vanishing gradient problem. The basic building block of ResNet is Sa rete tee ae 
a Residual block that is repeated throughout the network. eae 
Filter 3088 
Concatenation ern an 
f(x) 1x1 Conv on 
Weight layer aa ny 2i 
Maxpool fe 
D o 
Previous we 
f( x) +x Layer ssa T 
Figure 1 ResNet Block Figure 2 Inception Block ams 
Inception — 2014 z 1z 
Why: Lager kernels are preferred for more global features, on the other ue 
hand, smaller kernels provide good results in detecting area-specific = 
features. For effective recognition of such a variable-sized feature, we |“ et 
need kernels of different sizes. That is what Inception does. = 
What: The Inception network architecture consists of several inception p” a 
modules of the following structure. Each inception module consists of A ee a A 
four operations in parallel, 1x1 conv layer, 3x3 conv layer, 5x5 conv sn APRESS Tes ame Pata 
layer , max pooling T HeH 
How: Inception increases the network space from which the best e C Eea a 
network is to be chosen via training. Each inception module can esir EEES AE TE 
capture salient features at different levels. ; fasana faina 
Comparison ; Eee 
Network Year Salient Feature top5 accuracy |Parameters| FLOP no) 7 Ë MAE 3 
AlexNet 2012 Deeper 84.70% 62M 1.5B 7 a 
VGGNet 2014 Fixed-size kernels 92.30% 138M. 19.6B = m 
Inception 2014 Wider - Parallel kernels 93.30% 6.4M 2B E ee TE asst oo ranean 
ResNet-152| 2015 Shortcut connections 95.51% 60.3M 11B = S 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


Cheat Sheet — Ensemble Learning in ML 


What is Ensemble Learning? Wisdom of the crowd 


Combine multiple weak models/learners into one predictive model to reduce bias, variance and/or improve accuracy. 


Types of Ensemble Learning: N number of weak learners 

1.Bagging: Trains N different weak models (usually of same types — homogenous) with N non-overlapping subset of the 
input dataset in parallel. In the test phase, each model is evaluated. The label with the greatest number of predictions is 
selected as the prediction. Bagging methods reduces variance of the prediction 


2.Boosting: Trains N different weak models (usually of same types — homogenous) with the complete dataset in a 
sequential order. The datapoints wrongly classified with previous weak model is provided more weights to that they can 
be classified by the next weak leaner properly. In the test phase, each model is evaluated and based on the test error of 
each weak model, the prediction is weighted for voting. Boosting methods decreases the bias of the prediction. 


3.Stacking: Trains N different weak models (usually of different types — heterogenous) with one of the two subsets of the 
dataset in parallel. Once the weak learners are trained, they are used to trained a meta learner to combine their 
predictions and carry out final prediction using the other subset. In test phase, each model predicts its label, these set of 
labels are fed to the meta learner which generates the final prediction. 


The block diagrams, and comparison table for each of these three methods can be seen below. 


Ensemble Method — Boosting 


Ensemble Method — Bagging 


Input Dataset 
Complete dataset 


Step #1 

Assign equal weights 
to all the datapoints 
in the dataset 


Uniform weights 


Step #2a 


Train a weak model 


Step #2b 

Train Weak * Based on the final error on the 

Model #1 trained weak model, calculate a 

scalar alpha. 

+ Use alpha to increase the weights of 
wrongly classified points, and 
decrease the weights of correctly 
classified points 


with equal weights to 
all the datapoints 


f 


Train Weak 
Model #2 


Based on the final error on the 
trained weak model, calculate a 
scalar alpha. 

+ Use alpha to increase the weights of 

wrongly classified points, and 

decrease the weights of correctly 
classified points 


Step #3a 
Train a weak model 
with adjusted weights 
on all the datapoints 
in the dataset 


Adjusted weights 


Train Weak 
Model #3 


alpha3 


Train Weak 
Model #4 


Step #(n+1)a 
Train a weak model 
with adjusted weights 
on all the datapoints 
in the dataset 


Step #n+2 

In the test phase, predict from each 
weak model and vote their predictions 
weighted by the corresponding alpha to 
get final prediction 


Voting 


Final Prediction 


Parameter Bagging Boosting Stacking 


Focuses on Reducing variance Reducing bias Improving accuracy 


Nature of weak 


i Heterogenous 
learners is 


Homogenous Homogenous 


Learned voting 
(meta-learner) 


Weak learners are 


Simple voti 
aggregated by ai 


Weighted voting 


Step #1 Input Dataset 

Create N subsets 

feom orginal Subset #1 Subset #2 Subset #3 Subset #4 
dataset, one for each 


weak model 


Step #2 
Train each weak 


model with an Weak Model Weak Model 
independent #1 #3 


subset, in 
parallel 


Step #3 
In the test phase, predict from 
each weak model and vote their 

predictions to get final prediction 


Final Prediction 


Ensemble Method — Stacking 


Step #1 
Create 2 subsets from 
original dataset, one 
for training weak 
models and one for 
meta-model 


Input Dataset 
Subset #1 — Weak Learners 


Subset #2 — Meta Learner 


Step #2 
Train each weak 
model with the 
weak learner 
dataset 


Train Weak 
Model #1 


Train Weak 


Model #2 


Train Weak 
Model #3 


Train Weak 
Model #4 


Input Dataset 


Subset #41 — Weak Learners 


Subset #2 — Meta Learner 


Step #3 
Train a meta- 
learner for which 
the input is the 
outputs of the 
weak models for 
the Meta Learner 
dataset 


Trained Weak 
Model 


Trained Weak 
Model 


Trained Weak 
Model 


Trained Weak 
Model 


#1 #2 #3 #4 


Step #4 
Tn the test phase, feed the input to the 
weak models, collect the output and feed 
it to the meta model. The output of the 
meta model is the final prediction 


Final Prediction 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


Cheat Sheet — Autoencoder & Variational Autoencoder 


Context — Data Compression 
e Data compression is an essential phase in training a network. The idea is to compress the data so 
that the same amount of information can be represented by fewer bits. 


Patt Encoder (AE) 
Autoencoder is used to learn efficient embeddings of unlabeled data for a given network 
configuration. It consists of two parts, an encoder, and a decoder. 
e The encoder compresses the data from a higher-dimensional space to a lower-dimensional space (also 
called the latent space),while the decoder converts the latent space back to higher-dimensional space. 
e The entire encoder-decoder architecture is collectively trained on the loss function which encourages 
that the input is reconstructed at the output. Hence the loss function is the mean squared error 
between the encoder input and the decoder output. 
e The latent variable is not regularized. Picking a random latent variable will generate garbage output. 
e Latent variable is deterministic values and the space lacks the generative capability 
Latent space of AE 


Autoencoder — Block Diagram à 


: reconstructed 
input : o 
input 


loss = || — ĉ||2 = |z — dg(z)|l, = lle — dg(eo(z)) lo 


Variational Auto Encoder (VAE) 

e Variational autoencoder addresses the issue of non-regularized latent space in 
autoencoder and provides the generative capability to the entire space. 

e Instead of outputting the vectors in the latent space, the encoder of VAE outputs parameters of a 
pre-defined distribution in the latent space for every input. 

e The VAE then imposes a constraint on this latent distribution forcing it to be a normal distribution. 

e The latent variable in the compressed form is mean and variance 

e The training loss of VAE is defined as the sum of the reconstruction loss and the similarity loss (the 
KL divergence between the unit gaussian and decoder output distribution. 

e The latent variable is smooth and continuous i.e., random values of latent variable generates 
meaningful output at the decoder, hence the latent space has generative capabilities. 

e The input of the decoder is sampled from a gaussian with mean/variance of the output of encoder. 


T Variational Autoencoder — Block Diagram ĉ Latent space of VAE with KL loss 1 
z 4 
Max a 
oz = ; 
sampling 
latent latent vec 
distribution PRERS VEENI 0 3 
i reconstructed 
input é 
reconstruction loss = ||z — #||2 = ||a — dg(z)||, = ||£ — dg(Me + x€)||5 Input 2 
Ua, Ox =eg(x), € ~N(O,1) 2 
similarity loss = KL Divergence = Dgi(N (Hz, 02) || N(0,1)) P 
4 


loss = reconstruction loss + similarity loss o 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


Cheat Sheet — Data Siame 


len = 4 
1. List 
e Ordered collection of elements value EE 
e The position of each element is defined by the index 
e The elements can be accessed in any order 


index T 1 2 3 4 


° Linked List does not have their order defined by their 


e Each linked list element contains both the values and the physical placement in the EMOTY 
address (pointer) to the next linked list element. e Contiguous elements of the linked list are not placed 
e Hence the linked list can only be traversed sequentially going adjacent to each other in the memory. 
through each element at a time 
3. Stack pushl) 


e Stack is a sequential data structure which maintains the order of 6 
elements as they were inserted in. 

e Last In First Out (LIFO) order, which means that the elements can 
only be accessed in the reverse order as they were inserted into the 
stack. 

e The element to be inserted last, will the first one to get removed 
from the stack. 

e Push() adds an element at the head of the stack, 
while pop() removes an element from the dead of the stack 

e A real-life example of a stack is a stack of kitchen plates 


Enqueue() 
7 


after before after 


4. OEE 


Dequeue() © A queue is a sequential data structure that maintains the order 
of elements as they were inserted in 


I 


e First In First Out (FIFO), the element to be inserted first, will 
28 | 7 n I 1 the first one to get removed from the queue 
R Boe me wm e Whenever an element is added (Enqueue()) it is added to the 
end of the queue. On the other hand, element removal 


(Dequeue()) is done from the front of the queue. 
e A real-life example is a check-out line at a grocery store 


€ t a e t, fr f, 


5. HashTable 
Creates paired assignments (key mapped to values) so the 
pairs can be accessed in constant time ; 

e For each (key, value) pair, the key is passed through a Hash 
hash function to create a unique physical address for the keyl vall E Function 
value to be stored in the memory. 

¢ Hash function can end up generating the same physical 
address for different keys. This is called a collision. 


root 4 Tree 


e Maintains a hierarchical relation between its elements. 

e Root Node — The node at the top of the tree 

e Parent Node — Any node that has at least one child 

e Child Node — The successor of a parent node is known as a child node. A 
node can be both a parent and a child node. The root is never a child node. 

* Leaf Node— The node which does not have any child node. 

° Traversing — Passing through the nodes in a certain order, e.g BFS, DFS 


key2) val2 fash 


sub-tree 


4. Graph 

e A graph is a pair of sets (V, E), where V is set of all the vertices, E is set of all edges. © — VEREN 

* A neighbor of a node is set of all vertices connected with that node through an edge. — Edge 

e As opposed to trees, a graph can be cyclic, which means starting from a node and 
following the edges, you can end up on the same node 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


Cheat Sheet — Preparing for Coding Interviews 


Part 1 — How to prepare for coding interviews?* 


° The timeline: Start applying for jobs Graduation 


Start preparing È @ 


©—— > 1month < © > 3 months <————@® 


Fig. 1 — Preparation Timeline for Coding Interviews 
e Review Data structures and Complexities: 
The following 7 data structures are necessary for the interview, and their time/space complexity 
e List/Arrays, Linked List, Hash Table/dictionary, Tree, Graph, Heap, Queue 
e Click here for tutorial. 
e Practice coding questions: 
e Multiple online resources such as LeetCode.com, InterviewBit.com, HackerRank.com etc. 
e Pick one online resource and aim for easy and medium coding questions (approx. 100-150). 
e Beginners start preparing 2-3 months before the interview, and intermediates about 1 month. 


e From my personal experience, paid subscription of LeetCode.com was worth it. 
e Facebook, Uber, Google and Microsoft tagged question of LeetCode covered almost 90% of the 
questions asked 


Part 2 — How to answer a coding question? * 
e Listen to the question 
The interviewer will explain the question with an example. Note down the important points. 
e Talk about your understanding of the question 
Repeat the question and confirm your understanding. Ask clarifying questions such as 
1. Input/Output data type limitations 
2. Input size/length limitations 
3. Special/Corner cases 
e Discuss your approach 
Walk through how would you approach the problem and ask the interviewer if he agrees with it. 
Talk about the data structure you prefer and why. Discuss the solution with the bigger picture. 
e Start coding 
Ask the interviewer if you could start coding. Define useful functions and explain as you write. 
Think out loud so the interviewer can evaluate your thought process. 
e Discuss the time and space complexity 
Discuss the time and space complexity in terms of Big O for your coded approach. 
e Optimize the approach 
If your approach is not the most optimized one, the interviewer will hint you a few 
improvements. Pay attention to hints and try to optimize your code. 


Discuss Time 
& Space 
complexity 


Walk through 
your approach 


Ask Clarifying 
Questions 


Start Coding Optimize 


Fig. 2 — How to answer a coding question? 


*Disclaimer: The recommendations are based on personal experiences of the author. The mentioned approach and resources might work great for some, but not so much for others. 


Source: https: //www.cheatsheets.aqeel-anwar.com 


How to prepare for 


/ A behavioral interview? 


Collect stories, assign keywords, practice 
the STAR format 


List important keywords that will be populated with your personal 


Key W ords stories. Most common keywords are given in the table below 


Conflict Compromise to 


A aA. Negotiation a oA Creativity Flexibility Convincing 
Handling Challenging Working with A aie usu toa 
Crisis Situation dLitenlt people priorities not colleague Take Stand 
aligned style 
Handling —ve Coworker Working with a eae Your Influence 
feedback view of you deadline 8 weakness Others 

Handling Hose Converting ; en Conflict Mentorship/ 

à unexpected challenge to without enough i 
failure : : : Resolution Leadership 

situation opportunity data 


Stories 


1. List all the organizations you have been a part of. For example 
1. Academia: BSc, MSc, PhD 
2. Industry: Jobs, Internship 
3. Societies: Cultural, Technical, Sports 
2. Think of stories from step 1 that can fall into one of the keywords categories. The 
more stories the better. You should have at least 10-15 stories. 
3. Create a summary table by assigning multiple keywords to each stories. This will help 
you filter out the stories when the question asked in the interview. An example can be 


seen below 
Story 1: [Convincing] [Take Stand] [influence other] 
Story 2: [Mentorship] [Leadership] 
Story 3: [Conflict resolution] [Negotiation] 
Story 4: [decision-without-enough-data] 


STAR Format 


Write down the stories in the STAR format as explained in the 2/4 part of this cheat 
sheet. This will help you practice the organization of story in a meaningful way. 


Icon Source: www.flaticon.com 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


How to prepare for 
A behavioral interview? 


Direct*, meaningful*, personalized*, logical* 


*(Respective colors are used to identify these characteristics in the example) 


Example: “Tell us about a time when you had to convince senior executives” 


“I worked as an intern in XYZ company in 


Situation the summer of 2019. The project details 


provided to me was elaborative. After 


Explain the situation and some initial brainstorming, and research I 
3 realized that the project approach can be 
provide necessary context for modified to make it more efficient in 


terms of the underlying KPIs. I decided to 
talk to my manager about it.” 


your story. 


“I had an hour-long call with my manager 
and explained him in detail the proposed 
approach and how it could improve the 
KPIs. I was able to convince him. He 
asked me if I will be able to present my 


Task 


Explain the task and your 
responsibility in the 
situation 


proposed approach for approval in front of 
the higher executives. I agreed to it. I was 
working out of the ABC(city) office and 
the executives need to fly in from 
XYZ(city) office.” 


“I did a quick background check on the 


Action executives to know better about their area 

of expertise so that I can convince them 

Walk through the steps and accordingly. I prepared an elaborative 15 
: slide presentation starting with explaining 
actions you took to address their approach, moving onto my proposed 
the issue approach and finally comparing them on 


preliminary results. 


“After some active discussion we were able 
to establish that the proposed approach 
was better than the initial one. The 
executives proposed a few small changes 


Result 


State the outcome of the 
result of your actions 


to my approach and really appreciated my 
stand. At the end of my internship, I was 
selected among the 3 out of 68 interns 
who got to meet the senior vice president 
of the company over lunch.” 


Icon Source: www.flaticon.com 
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How to answer a am 


/ 4 behavioral question? 3» 


Understand, Extract, Map, Select and Apply 


Example: “Tell us about a time when you had to convince senior executives” 


Understand the question 
Example: A story where I was able to convince 


Underst and my seniors. Maybe they had something in mind, 


and I had a better approach and tried to 
convince them 


Extract keywords and tags 
Extract useful keywords that encapsulates the 


Extract gist of the question 


Example: 
[Convincing], [Creative], [Leadership] 


Map the keyword to your stories 
Shortlist all the stories that fall under the 


M ap keywords extracted from previous step 
Example: 
Storyl, Story2, Story3, Story4, ... , Story N 


Select the best story 
From the shortlisted stories, pick the one that 
Select best describes the question and has not been used 
so far in the interview 
Example: Story3 


Apply the STAR method 
Apply the STAR method on the selected story to 


Apply answer the question 


Example: See Cheat Sheet 2/3 for details 


Icon Source: www.flaticon.com 


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here 


Behavioral Interview 


I 4 Cheat Sheet 


Summarizing the behavioral interview 


Gather important topics as keywords 


Understand and collect all the important topics 
commonly asked in the interview 


Collect your stories 


Based on all the organizations you have been a part of, 


think of all the stories that fall under the keywords above 
How to e 
prepare Practice stories in STAR format 
f r the Practice each story using the STAR format. You will have 
O to answer the question following this format. 
interview 


Assign keywords to stories 


Assign each of your story one or more keywords. This will 
help you recall them quickly 


Create a summary table 


Create a summary table mapping stories to their associated 
keywords. This will be used during the behavioral question 


Understand the question 


Understand the question and clarify any confusions that 
you have 


Extract the keywords 


Try to extract one or more of the keywords from the 

How to question 

answer a 
. Map the keywords to stories 
question Based on the keywords extracted, find the stories using the 
: summary table created during preparation (Step 4 

during y g prep (Step 4) 

interview Select a story 


Since each keyword maybe assigned to multiple stories, 
select the one that is most relevant and has not been used. 


Apply the START format 


Once the story has been shortlisted, apply STAR format on 
the story to answer the question. 


Icon Source: www.flaticon.com 
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Follow the Author: 


Follow the author for more machine learning/data science content at 


e WY Medium: https: //aqeel-anwar.medium.com 


e @ LinkedIn: https: //www.linkedin.com/in/aqeelanwarmalik/ 


Feedback: 


If you find any error in the cheat sheets, please provide your feedback here 


Version History 
e Version 0.1.0.3 - Dec 25, 2021 
— Added cheat sheets: Autoencoder and Variational Autoencoder 


e Version 0.1.0.2 - May 19, 2021 
— Added cheat sheets: Data structures and Preparing for Coding Interview 
— Added tutorial links at the end of each cheat sheet 


e Version 0.1.0.1 - Apr 05, 2021 
Fixed minor typo issues in Baye’s Theorem, Regression analysis and Classifier and 
PCA dimensionality reduction cheat sheets. 


e Version 0.1.0.0 - Mar 30, 2021 
Initial draft with nine basics of ML and two behavioral interview cheat sheets. 
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